🎬 IMDB Movies Rating Prediction 🎬

imbd_logo.png

📚 Importing Libraries 📚

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
colors = ['#235E72']

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import accuracy_score, mean_absolute_error, mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')

⏳ Loading the dataset ⏳

In [2]:
imdb = pd.read_csv('imdb_movies_india.csv', encoding='latin-1')

🧠 Understanding of data 🧠

In [3]:
imdb.head()
Out[3]:
Name Year Duration Genre Rating Votes Director Actor 1 Actor 2 Actor 3
0 NaN NaN Drama NaN NaN J.S. Randhawa Manmauji Birbal Rajendra Bhatia
1 #Gadhvi (He thought he was Gandhi) (2019) 109 min Drama 7.0 8 Gaurav Bakshi Rasika Dugal Vivek Ghamande Arvind Jangid
2 #Homecoming (2021) 90 min Drama, Musical NaN NaN Soumyajit Majumdar Sayani Gupta Plabita Borthakur Roy Angana
3 #Yaaram (2019) 110 min Comedy, Romance 4.4 35 Ovais Khan Prateik Ishita Raj Siddhant Kapoor
4 ...And Once Again (2010) 105 min Drama NaN NaN Amol Palekar Rajat Kapoor Rituparna Sengupta Antara Mali
In [4]:
imdb.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB

🧹 Data Cleaning 🧹

In [5]:
# Checking null values

imdb.isna().sum()
Out[5]:
Name           0
Year         528
Duration    8269
Genre       1877
Rating      7590
Votes       7589
Director     525
Actor 1     1617
Actor 2     2384
Actor 3     3144
dtype: int64
In [6]:
# Locating rows with missing values in columns from 1 to 9

nulls = imdb[imdb.iloc[:, 1:9].isna().all(axis=1)]
nulls.head()
Out[6]:
Name Year Duration Genre Rating Votes Director Actor 1 Actor 2 Actor 3
1836 Bang Bang Reloaded NaN NaN NaN NaN NaN NaN NaN NaN NaN
1920 Battle of bittora NaN NaN NaN NaN NaN NaN NaN NaN NaN
2653 Campus NaN NaN NaN NaN NaN NaN NaN NaN NaN
3403 Dancing Dad NaN NaN NaN NaN NaN NaN NaN NaN NaN
3807 Dial 100 NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [7]:
#Checking if there are any typos

for col in imdb.select_dtypes(include = "object"):
    print(f"Name of Column: {col}")
    print(imdb[col].unique())
    print('\n', '-'*60, '\n')
Name of Column: Name
[' ' '#Gadhvi (He thought he was Gandhi)' '#Homecoming' ... 'Zulmi Raj'
 'Zulmi Shikari' 'Zulm-O-Sitam']

 ------------------------------------------------------------ 

Name of Column: Year
[nan '(2019)' '(2021)' '(2010)' '(1997)' '(2005)' '(2008)' '(2012)'
 '(2014)' '(2004)' '(2016)' '(1991)' '(1990)' '(2018)' '(1987)' '(1948)'
 '(1958)' '(2017)' '(2020)' '(2009)' '(2002)' '(1993)' '(1946)' '(1994)'
 '(2007)' '(2013)' '(2003)' '(1998)' '(1979)' '(1951)' '(1956)' '(1974)'
 '(2015)' '(2006)' '(1981)' '(1985)' '(2011)' '(2001)' '(1967)' '(1988)'
 '(1995)' '(1959)' '(1996)' '(1970)' '(1976)' '(2000)' '(1999)' '(1973)'
 '(1968)' '(1943)' '(1953)' '(1986)' '(1983)' '(1989)' '(1982)' '(1977)'
 '(1957)' '(1950)' '(1992)' '(1969)' '(1975)' '(1947)' '(1972)' '(1971)'
 '(1935)' '(1978)' '(1960)' '(1944)' '(1963)' '(1940)' '(1984)' '(1934)'
 '(1955)' '(1936)' '(1980)' '(1966)' '(1949)' '(1962)' '(1964)' '(1952)'
 '(1933)' '(1942)' '(1939)' '(1954)' '(1945)' '(1961)' '(1965)' '(1938)'
 '(1941)' '(1931)' '(1937)' '(2022)' '(1932)' '(1923)' '(1915)' '(1928)'
 '(1922)' '(1917)' '(1913)' '(1930)' '(1926)' '(1914)' '(1924)']

 ------------------------------------------------------------ 

Name of Column: Duration
[nan '109 min' '90 min' '110 min' '105 min' '147 min' '142 min' '59 min'
 '82 min' '116 min' '96 min' '120 min' '161 min' '166 min' '102 min'
 '87 min' '132 min' '66 min' '146 min' '112 min' '168 min' '158 min'
 '126 min' '94 min' '138 min' '124 min' '144 min' '157 min' '136 min'
 '107 min' '113 min' '80 min' '122 min' '149 min' '148 min' '130 min'
 '121 min' '188 min' '115 min' '103 min' '114 min' '170 min' '100 min'
 '99 min' '140 min' '128 min' '93 min' '125 min' '145 min' '75 min'
 '111 min' '134 min' '85 min' '104 min' '92 min' '137 min' '127 min'
 '150 min' '119 min' '135 min' '86 min' '76 min' '70 min' '72 min'
 '151 min' '95 min' '52 min' '89 min' '143 min' '177 min' '117 min'
 '123 min' '154 min' '88 min' '175 min' '153 min' '78 min' '139 min'
 '133 min' '101 min' '180 min' '60 min' '46 min' '164 min' '162 min'
 '171 min' '160 min' '152 min' '62 min' '163 min' '165 min' '141 min'
 '210 min' '129 min' '156 min' '240 min' '172 min' '155 min' '118 min'
 '167 min' '106 min' '193 min' '57 min' '108 min' '45 min' '195 min'
 '174 min' '81 min' '178 min' '58 min' '184 min' '97 min' '98 min'
 '131 min' '176 min' '169 min' '77 min' '91 min' '84 min' '173 min'
 '74 min' '67 min' '181 min' '300 min' '79 min' '65 min' '48 min'
 '183 min' '159 min' '83 min' '68 min' '49 min' '201 min' '64 min'
 '186 min' '50 min' '69 min' '207 min' '55 min' '61 min' '185 min'
 '187 min' '216 min' '63 min' '54 min' '198 min' '51 min' '71 min'
 '73 min' '218 min' '191 min' '321 min' '199 min' '53 min' '56 min'
 '179 min' '47 min' '206 min' '190 min' '211 min' '247 min' '213 min'
 '223 min' '2 min' '189 min' '224 min' '202 min' '255 min' '197 min'
 '182 min' '214 min' '208 min' '21 min' '200 min' '192 min' '37 min'
 '261 min' '238 min' '204 min' '235 min' '298 min' '217 min' '250 min']

 ------------------------------------------------------------ 

Name of Column: Genre
['Drama' 'Drama, Musical' 'Comedy, Romance' 'Comedy, Drama, Musical'
 'Drama, Romance, War' 'Documentary' 'Horror, Mystery, Thriller'
 'Action, Crime, Thriller' 'Horror' 'Horror, Romance, Thriller'
 'Comedy, Drama, Romance' 'Thriller' 'Comedy, Drama' nan
 'Comedy, Drama, Fantasy' 'Comedy, Drama, Family' 'Crime, Drama, Mystery'
 'Horror, Thriller' 'Biography' 'Comedy, Horror' 'Action'
 'Drama, Horror, Mystery' 'Comedy' 'Action, Thriller' 'Drama, History'
 'Drama, History, Sport' 'Horror, Mystery, Romance' 'Horror, Mystery'
 'Drama, Horror, Romance' 'Action, Drama, History' 'Action, Drama, War'
 'Comedy, Family' 'Adventure, Horror, Mystery' 'Action, Sci-Fi'
 'Crime, Mystery, Thriller' 'War' 'Sport' 'Biography, Drama, History'
 'Horror, Romance' 'Crime, Drama' 'Drama, Romance' 'Adventure, Drama'
 'Comedy, Mystery, Thriller' 'Action, Crime, Drama' 'Crime, Thriller'
 'Horror, Sci-Fi, Thriller' 'Crime, Drama, Thriller'
 'Drama, Mystery, Thriller' 'Drama, Sport' 'Drama, Family, Musical'
 'Action, Comedy' 'Comedy, Thriller' 'Action, Adventure, Fantasy'
 'Drama, Romance, Thriller' 'Action, Drama' 'Drama, Horror, Musical'
 'Action, Biography, Drama' 'Adventure, Comedy, Drama' 'Mystery'
 'Action, Fantasy, Mystery' 'Adventure, Drama, Mystery'
 'Mystery, Thriller' 'Adventure' 'Drama, Musical, Thriller'
 'Comedy, Crime, Drama' 'Musical, Romance' 'Documentary, Music'
 'Documentary, History, Music' 'Drama, Fantasy, Mystery'
 'Drama, Family, Sport' 'Drama, Thriller' 'Documentary, Biography'
 'Action, Adventure, Comedy' 'Romance' 'Comedy, Drama, Music'
 'Comedy, Horror, Mystery' 'Musical' 'Musical, Romance, Drama'
 'Family, Romance' 'Action, Sci-Fi, Thriller' 'Action, Drama, Romance'
 'Mystery, Romance' 'Fantasy' 'Family' 'Drama, Family'
 'Action, Comedy, Drama' 'Action, Drama, Thriller'
 'Drama, Horror, Thriller' 'Drama, Musical, Romance' 'Comedy, Sci-Fi'
 'Action, Romance' 'Action, Crime' 'Action, Drama, Crime'
 'Drama, Family, Music' 'Action, Mystery, Thriller'
 'Action, Drama, Family' 'Action, Mystery' 'Drama, History, Romance'
 'Crime, Drama, Music' 'Sci-Fi' 'Animation' 'Crime, Mystery, Romance'
 'Action, Adventure, Romance' 'Music, Romance' 'Action, Comedy, Crime'
 'Comedy, Family, Fantasy' 'Romance, Drama' 'Drama, Family, Romance'
 'Romance, Drama, Family' 'Musical, Romance, Thriller'
 'Family, Musical, Romance' 'Action, Drama, Fantasy' 'Family, Drama'
 'Crime, Drama, Romance' 'Musical, Drama, Romance' 'Drama, Music, Musical'
 'Drama, Mystery' 'Adventure, Comedy, Romance' 'Crime, Drama, Horror'
 'Family, Music, Musical' 'Action, Musical, Thriller'
 'Action, Romance, Thriller' 'Romance, Thriller' 'Drama, Music'
 'Crime, Drama, Musical' 'Action, Crime, Mystery'
 'Action, Adventure, Thriller' 'Comedy, Romance, Sci-Fi' 'Crime'
 'Action, Drama, Mystery' 'Action, Comedy, Thriller' 'Biography, Drama'
 'Action, Comedy, Fantasy' 'Drama, Family, Horror'
 'Action, Adventure, Family' 'Documentary, Biography, Musical'
 'Action, Drama, Musical' 'Adventure, Thriller' 'Crime, Mystery'
 'Drama, Crime' 'Drama, Fantasy, Romance' 'Comedy, Romance, Thriller'
 'Musical, Comedy, Drama' 'Biography, History, War'
 'Action, Comedy, Romance' 'Drama, History, Musical'
 'Action, Crime, Horror' 'Adventure, Fantasy' 'Adventure, Drama, Fantasy'
 'Adventure, Fantasy, Romance' 'Action, Adventure, Drama'
 'Action, Adventure' 'Comedy, Crime' 'Crime, Drama, Fantasy'
 'Adventure, Drama, Romance' 'History' 'Drama, Fantasy, Thriller'
 'Musical, Fantasy' 'Documentary, Thriller' 'Mystery, Romance, Musical'
 'Family, Drama, Romance' 'History, Musical, Romance'
 'Musical, Drama, Crime' 'Adventure, Crime, Romance'
 'Musical, Thriller, Mystery' 'Drama, Comedy' 'Biography, Drama, Romance'
 'Biography, Music' 'Biography, Drama, Music' 'Drama, Sci-Fi'
 'Drama, Family, Thriller' 'Comedy, Musical, Romance'
 'Drama, Family, Comedy' 'Action, Thriller, Romance'
 'Animation, Adventure' 'Action, Crime, Musical' 'Action, Crime, Romance'
 'Animation, Action, Adventure' 'Action, Drama, Sport' 'Comedy, History'
 'Documentary, History' 'Drama, Comedy, Family' 'Action, Adventure, Crime'
 'Documentary, Biography, Music' 'Comedy, Musical'
 'Biography, Crime, Thriller' 'Adventure, Mystery, Thriller'
 'Biography, Drama, Sport' 'Action, Comedy, Musical'
 'Mystery, Romance, Thriller' 'Action, Adventure, Musical'
 'Crime, Musical, Mystery' 'Action, Thriller, Crime'
 'Adventure, Comedy, Crime' 'Comedy, Horror, Musical' 'Adventure, Family'
 'Family, Thriller' 'Drama, Action, Crime' 'Drama, War'
 'Action, Drama, Adventure' 'Adventure, Fantasy, History'
 'Fantasy, Musical' 'Comedy, Drama, Thriller' 'Drama, Fantasy'
 'Musical, Drama' 'Action, Drama, Horror' 'Biography, Crime, Drama'
 'Action, Drama, Music' 'Adventure, Drama, Family'
 'Drama, Romance, Musical' 'Comedy, Musical, Drama'
 'Adventure, Comedy, Musical' 'Crime, Drama, Family'
 'Thriller, Musical, Mystery' 'Documentary, Adventure, Crime'
 'Drama, Action, Horror' 'Adventure, Crime, Drama'
 'Documentary, Biography, Sport' 'Crime, Fantasy, Mystery'
 'Documentary, Biography, Drama' 'Action, Fantasy, Thriller'
 'Adventure, Drama, History' 'Animation, Drama, History'
 'Comedy, Horror, Thriller' 'Drama, Family, History' 'Animation, History'
 'Biography, Drama, Musical' 'Music' 'Family, Comedy' 'Adventure, Mystery'
 'Family, Fantasy' 'Documentary, History, News' 'Drama, Mystery, Romance'
 'Comedy, Fantasy' 'Action, Crime, Family' 'Drama, Musical, Mystery'
 'Action, Thriller, Mystery' 'Drama, Family, Fantasy' 'Action, Family'
 'Action, Adventure, Mystery' 'Horror, Fantasy' 'Comedy, Action'
 'Adventure, Romance' 'Drama, Adventure' 'Animation, Drama, Romance'
 'Comedy, Crime, Romance' 'Adventure, Comedy' 'Comedy, Drama, Sport'
 'Documentary, Crime, History' 'Musical, Mystery, Drama'
 'Adventure, Drama, Sci-Fi' 'Action, Romance, Western'
 'Comedy, Fantasy, Romance' 'Animation, Action, Comedy'
 'Drama, Fantasy, Sci-Fi' 'Drama, Horror' 'Family, Drama, Comedy'
 'Action, Adventure, History' 'Comedy, Family, Romance'
 'Biography, History' 'Animation, Family' 'Drama, Fantasy, History'
 'Animation, Adventure, Fantasy' 'Adventure, Comedy, Family'
 'Drama, History, War' 'Animation, Drama, Fantasy'
 'Action, Musical, Romance' 'Crime, Action, Drama'
 'Comedy, Romance, Musical' 'Fantasy, Drama' 'Musical, Action, Crime'
 'Documentary, Drama' 'Action, Horror, Thriller' 'Action, Horror, Sci-Fi'
 'Mystery, Sci-Fi, Thriller' 'Biography, Family' 'Drama, Action, Comedy'
 'Drama, Music, Romance' 'Action, Biography, Crime'
 'Adventure, Drama, Musical' 'Family, Music, Romance'
 'Fantasy, Mystery, Romance' 'Drama, Crime, Family'
 'Drama, Family, Action' 'Romance, Comedy, Drama'
 'Animation, Adventure, Comedy' 'Sci-Fi, Thriller'
 'Romance, Family, Drama' 'Action, Family, Thriller'
 'Adventure, Crime, Thriller' 'Drama, Romance, Sport'
 'Comedy, Crime, Mystery' 'Adventure, Comedy, Mystery' 'Action, Fantasy'
 'Comedy, Mystery' 'Animation, Adventure, Family'
 'Adventure, Drama, Music' 'Biography, Drama, War'
 'Documentary, Comedy, Drama' 'Musical, Drama, Family'
 'Animation, Comedy, Drama' 'Fantasy, Musical, Drama'
 'Adventure, Crime, Mystery' 'Comedy, Drama, Mystery' 'Documentary, News'
 'Drama, Musical, Family' 'Action, Romance, Drama'
 'Comedy, Crime, Thriller' 'Action, Musical' 'Action, History'
 'Action, Comedy, Mystery' 'Drama, Family, Mystery'
 'Adventure, Drama, Thriller' 'Documentary, Reality-TV'
 'Action, Fantasy, Horror' 'Drama, History, Thriller'
 'Documentary, Family' 'Documentary, Biography, Family' 'Comedy, Sport'
 'Animation, Comedy, Family' 'Crime, Romance, Thriller'
 'Comedy, Musical, Action' 'Action, Mystery, Sci-Fi'
 'Comedy, Crime, Musical' 'Drama, Adventure, Action' 'History, Romance'
 'Reality-TV' 'Fantasy, History' 'Family, Drama, Thriller'
 'Musical, Mystery, Thriller' 'Musical, Comedy, Romance'
 'Musical, Action, Drama' 'Action, Musical, War' 'Romance, Comedy'
 'Horror, Crime, Thriller' 'Crime, Drama, History' 'Comedy, Drama, Horror'
 'Crime, Horror, Thriller' 'Animation, Comedy' 'Romance, Action, Crime'
 'Musical, Thriller' 'Action, Romance, Comedy' 'Comedy, Family, Musical'
 'Horror, Drama, Mystery' 'Thriller, Mystery, Family'
 'Comedy, Drama, Sci-Fi' 'Documentary, Adventure'
 'Documentary, Biography, Crime' 'Musical, Action' 'Musical, Mystery'
 'Action, Crime, Sci-Fi' 'Action, Horror, Mystery' 'Fantasy, Horror'
 'Adventure, Family, Fantasy' 'Fantasy, Sci-Fi' 'Comedy, War'
 'Romance, Action, Drama' 'Musical, Family, Romance'
 'Romance, Drama, Action' 'Family, Comedy, Drama' 'Comedy, Music, Romance'
 'Comedy, Family, Sci-Fi' 'Action, Drama, Western'
 'Adventure, Romance, Thriller' 'Biography, Comedy, Drama'
 'Action, Mystery, Romance' 'Romance, Sport' 'Crime, Romance'
 'Action, Thriller, Western' 'Crime, Musical, Romance'
 'Romance, Thriller, Mystery' 'Drama, Crime, Mystery'
 'Biography, Drama, Family' 'Action, Family, Mystery'
 'Comedy, Mystery, Romance' 'Drama, Thriller, Action' 'Documentary, Short'
 'Documentary, Western' 'Musical, Family, Drama' 'Action, Family, Musical'
 'Animation, Family, Musical' 'Drama, Fantasy, Horror'
 'Action, Adventure, Sci-Fi' 'Drama, Action, Musical'
 'Drama, Musical, Sport' 'Action, Comedy, Horror'
 'Drama, Fantasy, Musical' 'Action, Fantasy, Musical' 'Animation, Action'
 'Comedy, Music' 'Documentary, Drama, Romance' 'Drama, Music, Thriller'
 'Fantasy, Musical, Mystery' 'Drama, Fantasy, War' 'Action, War'
 'Action, Adventure, War' 'Horror, Musical' 'Fantasy, Mystery, Thriller'
 'Adventure, Biography, Drama' 'Family, Romance, Sci-Fi'
 'Drama, Romance, Family' 'Animation, Adventure, Drama'
 'Family, Romance, Drama' 'Animation, Action, Sci-Fi'
 'Adventure, Comedy, Fantasy' 'Comedy, Crime, Family'
 'Horror, Musical, Thriller' 'Biography, Drama, Thriller' 'Drama, Western'
 'Romance, Sci-Fi, Thriller' 'Comedy, Musical, Family'
 'Comedy, Horror, Romance' 'Thriller, Action' 'Fantasy, Thriller, Action'
 'Fantasy, Romance' 'Action, Drama, Comedy' 'Family, Fantasy, Romance'
 'Comedy, Crime, Horror' 'Horror, Mystery, Sci-Fi'
 'Animation, Action, Drama' 'Family, Mystery'
 'Adventure, Biography, History' 'Fantasy, Horror, Mystery'
 'Family, Musical' 'Drama, Family, Adventure' 'Crime, Horror, Mystery'
 'Documentary, Drama, Fantasy' 'Action, Adventure, Biography'
 'Biography, History, Thriller' 'Action, Family, Drama'
 'Documentary, Drama, Sport' 'Thriller, Mystery' 'Musical, Drama, Comedy'
 'Documentary, History, War' 'Adventure, Horror, Thriller'
 'Action, Adventure, Horror' 'Action, Crime, War'
 'Adventure, Musical, Romance' 'Action, Fantasy, Sci-Fi'
 'Drama, Comedy, Action' 'Documentary, Sport'
 'Documentary, Adventure, Music' 'Drama, Action, Family'
 'Adventure, History, Thriller' 'Adventure, Horror, Romance'
 'Adventure, Crime, Horror' 'Mystery, Musical, Romance'
 'Action, Crime, History' 'Documentary, Musical'
 'Adventure, Fantasy, Musical' 'Documentary, Family, History'
 'Documentary, Drama, Family' 'Drama, Mystery, Sci-Fi'
 'Animation, Drama, Musical' 'Drama, History, Mystery'
 'Drama, Sport, Thriller' 'Action, Crime, Fantasy'
 'Comedy, Musical, Mystery' 'Romance, Musical, Action'
 'Musical, Drama, Fantasy' 'Animation, Family, History'
 'Action, Drama, News' 'Romance, Musical, Comedy'
 'Adventure, Fantasy, Horror' 'Adventure, History'
 'Comedy, Drama, History' 'Mystery, Sci-Fi' 'Action, Thriller, War'
 'Documentary, Drama, News' 'Documentary, Crime, Mystery'
 'Adventure, Horror' 'Animation, Drama, Adventure'
 'Crime, Horror, Romance' 'Documentary, Adventure, Drama'
 'Documentary, Biography, History' 'Fantasy, Horror, Romance'
 'Comedy, Fantasy, Musical' 'Crime, Musical, Thriller' 'Documentary, War'
 'Action, Comedy, War' 'Crime, Drama, Sport' 'Musical, Adventure, Drama'
 'Horror, Romance, Sci-Fi' 'Musical, Mystery, Romance'
 'Romance, Musical, Drama' 'Adventure, Fantasy, Sci-Fi']

 ------------------------------------------------------------ 

Name of Column: Votes
[nan '8' '35' ... '70,344' '408' '1,496']

 ------------------------------------------------------------ 

Name of Column: Director
['J.S. Randhawa' 'Gaurav Bakshi' 'Soumyajit Majumdar' ... 'Mozez Singh'
 'Ved Prakash' 'Kiran Thej']

 ------------------------------------------------------------ 

Name of Column: Actor 1
['Manmauji' 'Rasika Dugal' 'Sayani Gupta' ... 'Meghan Jadhav'
 'Roohi Berde' 'Sangeeta Tiwari']

 ------------------------------------------------------------ 

Name of Column: Actor 2
['Birbal' 'Vivek Ghamande' 'Plabita Borthakur' ... 'Devan Sanjeev'
 'Prince Daniel' 'Sarah Jane Dias']

 ------------------------------------------------------------ 

Name of Column: Actor 3
['Rajendra Bhatia' 'Arvind Jangid' 'Roy Angana' ... 'Shatakshi Gupta'
 'Valerie Agha' 'Suparna Anand']

 ------------------------------------------------------------ 

In [8]:
# Handling the null values
imdb.dropna(subset=['Name', 'Year', 'Duration', 'Rating', 'Votes', 'Director', 'Actor 1', 'Actor 2', 'Actor 3'], inplace=True)

#Extracting only the text part from the Name column
imdb['Name'] = imdb['Name'].str.extract('([A-Za-z\s\'\-]+)')

# Replacing the brackets from year column as observed above
imdb['Year'] = imdb['Year'].str.replace(r'[()]', '', regex=True).astype(int)

# Convert 'Duration' to numeric and replacing the min, while keeping only numerical part
imdb['Duration'] = pd.to_numeric(imdb['Duration'].str.replace(r' min', '', regex=True), errors='coerce')

# Splitting the genre by , to keep only unique genres and replacing the null values with mode
imdb['Genre'] = imdb['Genre'].str.split(', ')
imdb = imdb.explode('Genre')
imdb['Genre'].fillna(imdb['Genre'].mode()[0], inplace=True)

# Convert 'Votes' to numeric and replace the , to keep only numerical part
imdb['Votes'] = pd.to_numeric(imdb['Votes'].str.replace(',', ''), errors='coerce')
In [9]:
#checking duplicate values by Name and Year

duplicate = imdb.groupby(['Name', 'Year']).filter(lambda x: len(x) > 1)
duplicate.head(5)
Out[9]:
Name Year Duration Genre Rating Votes Director Actor 1 Actor 2 Actor 3
3 Yaaram 2019 110 Comedy 4.4 35 Ovais Khan Prateik Ishita Raj Siddhant Kapoor
3 Yaaram 2019 110 Romance 4.4 35 Ovais Khan Prateik Ishita Raj Siddhant Kapoor
5 Aur Pyaar Ho Gaya 1997 147 Comedy 4.7 827 Rahul Rawail Bobby Deol Aishwarya Rai Bachchan Shammi Kapoor
5 Aur Pyaar Ho Gaya 1997 147 Drama 4.7 827 Rahul Rawail Bobby Deol Aishwarya Rai Bachchan Shammi Kapoor
5 Aur Pyaar Ho Gaya 1997 147 Musical 4.7 827 Rahul Rawail Bobby Deol Aishwarya Rai Bachchan Shammi Kapoor
In [10]:
# Dropping the duplicated values by Name
imdb = imdb.drop_duplicates(subset=['Name'], keep=False)

📊 Insights:

  • Every column had null values except name column
  • We also found that the following columns: Name, Year, Duration, Genre, and Votes had typos that were corrected later
  • The null values in most of the columns were dropped and for Genre, it was filled with the mode value
  • Later, we checked the duplicated values and dropped them based on Name column

📊 Exploratory Data Analysis 📊


Descriptive Analysis¶


In [11]:
imdb.describe()
Out[11]:
Year Duration Rating Votes
count 1528.000000 1528.000000 1528.000000 1528.000000
mean 1997.972513 123.823953 5.976243 552.479712
std 21.181921 25.108144 1.412547 4311.631841
min 1931.000000 45.000000 1.600000 5.000000
25% 1985.000000 107.000000 5.100000 14.000000
50% 2004.000000 126.000000 6.100000 34.000000
75% 2016.000000 140.000000 7.000000 127.250000
max 2021.000000 300.000000 9.400000 101014.000000
In [12]:
imdb.describe(include = 'O')
Out[12]:
Name Genre Director Actor 1 Actor 2 Actor 3
count 1528 1528 1528 1528 1528 1528
unique 1528 20 1114 1010 1131 1154
top Gadhvi Drama Kanti Shah Mithun Chakraborty Mithun Chakraborty Pran
freq 1 789 13 22 12 16
In [13]:
# Find the row with the highest number of votes
max_votes_row = imdb[imdb['Votes'] == imdb['Votes'].max()]

# Get the name of the movie with the highest votes
movie_highest_votes = max_votes_row['Name'].values[0]

# Find the number of votes for the movie with the highest votes
votes_highest_votes = max_votes_row['Votes'].values[0]

print("Movie with the highest votes:", movie_highest_votes)
print("Number of votes for the movie with the highest votes:", votes_highest_votes)
print('\n', '='*100, '\n')


# Find the row with the lowest number of votes
min_votes_row = imdb[imdb['Votes'] == imdb['Votes'].min()]

# Get the name of the movie with the lowest votes
movie_lowest_votes = min_votes_row['Name'].values[0]

# Find the number of votes for the movie with the lowest votes
votes_lowest_votes = min_votes_row['Votes'].values[0]

print("Movie with the highest votes:", movie_lowest_votes)
print("Number of votes for the movie with the highest votes:", votes_lowest_votes)
Movie with the highest votes: My Name Is Khan
Number of votes for the movie with the highest votes: 101014

 ==================================================================================================== 

Movie with the highest votes: Anmol Sitaare
Number of votes for the movie with the highest votes: 5
In [14]:
# Find the row with the highest rating
max_rating_row = imdb[imdb['Rating'] == imdb['Rating'].max()]
movie_highest_rating = max_rating_row['Name'].values[0]
votes_highest_rating = max_rating_row['Votes'].values[0]

print("Movie with the highest rating:", movie_highest_rating)
print("Number of votes for the movie with the highest rating:", votes_highest_rating)
print('\n', '='*100, '\n')


# Find the row with the lowest rating
min_rating_row = imdb[imdb['Rating'] == imdb['Rating'].min()]
movie_lowest_rating = min_rating_row['Name'].values[0]
votes_lowest_rating = min_rating_row['Votes'].values[0]

print("Movie with the highest rating:", movie_lowest_rating)
print("Number of votes for the movie with the highest rating:", votes_lowest_rating)
Movie with the highest rating: June
Number of votes for the movie with the highest rating: 18

 ==================================================================================================== 

Movie with the highest rating: Mumbai Can Dance Saalaa
Number of votes for the movie with the highest rating: 43
In [15]:
# Group the dataset by the 'Director' column and count the number of movies each director has directed
director_counts = imdb['Director'].value_counts()

# Find the director with the highest number of movies directed
most_prolific_director = director_counts.idxmax()
num_movies_directed = director_counts.max()

print("Director with the most movies directed:", most_prolific_director)
print("Number of movies directed by", most_prolific_director, ":", num_movies_directed)
print('\n', '='*100, '\n')


# Group the dataset by the 'Director' column and count the number of movies each director has directed
director_counts = imdb['Director'].value_counts()

# Find the director with the lowest number of movies directed
least_prolific_director = director_counts.idxmin()
num_movies_directed = director_counts.min()

print("Director with the most movies directed:", least_prolific_director)
print("Number of movies directed by", most_prolific_director, ":", num_movies_directed)
Director with the most movies directed: Kanti Shah
Number of movies directed by Kanti Shah : 13

 ==================================================================================================== 

Director with the most movies directed: Sikandar Khanna
Number of movies directed by Kanti Shah : 1

📊 Insights:

  • The analysis revealed that the first year is 1931 in the dataset and there's a movie that have just 45 min duration
  • Furthermore, the drame genre is quite famous and first main actor with highest count is Mithun
  • The analysis also reveals about the best performing and worst movies based on votes and ratings as can be seen above
  • Information about directors who directed most and least of the movies can be seen above

Univariate Anaysis¶


In [16]:
fig_year = px.histogram(imdb, x = 'Year', histnorm='probability density', nbins = 30, color_discrete_sequence = colors)
fig_year.update_traces(selector=dict(type='histogram'))
fig_year.update_layout(title='Distribution of Year', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Year', yaxis_title='Probability Density', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), bargap=0.02, plot_bgcolor = 'white')
fig_year.show()
In [17]:
fig_duration = px.histogram(imdb, x = 'Duration', histnorm='probability density', nbins = 40, color_discrete_sequence = colors)
fig_duration.update_traces(selector=dict(type='histogram'))
fig_duration.update_layout(title='Distribution of Duration', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Duration', yaxis_title='Probability Density', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), bargap=0.02, plot_bgcolor = 'white')
fig_duration.show()
In [18]:
fig_rating = px.histogram(imdb, x = 'Rating', histnorm='probability density', nbins = 40, color_discrete_sequence = colors)
fig_rating.update_traces(selector=dict(type='histogram'))
fig_rating.update_layout(title='Distribution of Rating', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Rating', yaxis_title='Probability Density', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), bargap=0.02, plot_bgcolor = 'white')
fig_rating.show()
In [19]:
fig_votes = px.box(imdb, x = 'Votes', color_discrete_sequence = colors)
fig_votes.update_layout(title='Distribution of Votes', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Votes', yaxis_title='Probability Density', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), plot_bgcolor = 'white')
fig_votes.show()

📊 Insights:

  • The distribution of Year is skewed towards left with a high concentration of movies directed in between 2015-2019

  • The duration of movies has a guassian distribution with a very few outliers

  • The distribution of Rating is also having a guassian distribution with a high concentration of 6.6 and 6.7

  • The number of votes has a plenty of outliers


Bivariate Analysis¶


In [20]:
year_avg_rating = imdb.groupby('Year')['Rating'].mean().reset_index()

top_5_years = year_avg_rating.nlargest(10, 'Rating')
fig = px.bar(top_5_years, x='Year', y='Rating', title='Top 10 Years by Average Rating', color = "Rating", color_continuous_scale = "darkmint")
fig.update_xaxes(type='category')
fig.update_layout(xaxis_title='Year', yaxis_title='Average Rating', plot_bgcolor = 'white')
fig.show()
In [21]:
# Group data by Year and calculate the average rating
average_rating_by_year = imdb.groupby('Year')['Rating'].mean().reset_index()

# Create the line plot with Plotly Express
fig = px.line(average_rating_by_year, x='Year', y='Rating', color_discrete_sequence=['#559C9E'])
fig.update_layout(title='Are there any trends in ratings across year?', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Year', yaxis_title='Rating', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), plot_bgcolor = 'white')
fig.show()
In [22]:
# Group data by Year and calculate the average rating
average_rating_by_year = imdb.groupby('Year')['Votes'].mean().reset_index()

# Create the line plot with Plotly Express
fig = px.line(average_rating_by_year, x='Year', y='Votes', color_discrete_sequence=['#559C9E'])
fig.update_layout(title='Are there any trends in votes across year?', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Year', yaxis_title='Votes', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), plot_bgcolor = 'white')
fig.show()
In [23]:
# Group data by Year and calculate the average rating
average_rating_by_year = imdb.groupby(['Year', 'Genre'])['Rating'].mean().reset_index()

# Get the top 3 genres
top_3_genres = imdb['Genre'].value_counts().head(3).index

# Filter the data to include only the top 3 genres
average_rating_by_year = average_rating_by_year[average_rating_by_year['Genre'].isin(top_3_genres)]

# Create the line plot with Plotly Express
fig = px.line(average_rating_by_year, x='Year', y='Rating', color = "Genre", color_discrete_sequence=['#559C9E', '#0B1F26', '#00CC96'])

# Customize the layout
fig.update_layout(title='Average Rating by Year for Top 3 Genres', xaxis_title='Year', yaxis_title='Average Rating', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), plot_bgcolor='white')

# Show the plot
fig.show()
In [24]:
fig_dur_rat = px.scatter(imdb, x = 'Duration', y = 'Rating', trendline='ols', color = "Rating", color_continuous_scale = "darkmint")
fig_dur_rat.update_layout(title='Does length of movie have any impact on rating?', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Duration of Movie in Minutes', yaxis_title='Rating of a movie', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), plot_bgcolor = 'white')
fig_dur_rat.show()
In [25]:
fig_dur_votes = px.scatter(imdb, x = 'Duration', y = 'Votes', trendline='ols', color = "Votes", color_continuous_scale = "darkmint")
fig_dur_votes.update_layout(title='Does length of movie have any impact on Votes?', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Duration of Movie in Minutes', yaxis_title='Votes of a movie', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), plot_bgcolor = 'white')
fig_dur_votes.show()
In [26]:
fig_rat_votes = px.scatter(imdb, x = 'Rating', y = 'Votes', trendline='ols', color = "Votes", color_continuous_scale = "darkmint")
fig_rat_votes.update_layout(title='Does Ratings of movie have any impact on Votes?', title_x=0.5, title_pad=dict(t=20), title_font=dict(size=20), xaxis_title='Ratings of Movies', yaxis_title='Votes of movies', xaxis=dict(showgrid=False), yaxis=dict(showgrid=False), plot_bgcolor = 'white')
fig_rat_votes.show()

📊 Insights:

  • The Year with highets rating on average is 1944.
  • We can also see that from 1984, there is a downward trend for ratings until 1993, and the **Fare **
  • Then after 2013, there is a conntinous upward trend for movies rating
  • Furthermore, the insights reveal that in Year 2010, some movies were released that have the highest votes on average
  • Moreover, in terms of Genre, the Drama has the highest rating since it's start
  • It can be observed that Comedy Genre was started in 1953 and the year 1964 was the start of Action Genre
  • The Short duration movies seems to get high ratings and votes, signifying that people may get bored with long duration movies

⚙️ Data Preprocessing ⚙️

In [27]:
# Dropping non essential columns
imdb.drop('Name', axis = 1, inplace = True)

1. Feature Engineering¶


In [28]:
# Grouping the columns with their average rating and then creating a new feature

genre_mean_rating = imdb.groupby('Genre')['Rating'].transform('mean')
imdb['Genre_mean_rating'] = genre_mean_rating

director_mean_rating = imdb.groupby('Director')['Rating'].transform('mean')
imdb['Director_encoded'] = director_mean_rating

actor1_mean_rating = imdb.groupby('Actor 1')['Rating'].transform('mean')
imdb['Actor1_encoded'] = actor1_mean_rating

actor2_mean_rating = imdb.groupby('Actor 2')['Rating'].transform('mean')
imdb['Actor2_encoded'] = actor2_mean_rating

actor3_mean_rating = imdb.groupby('Actor 3')['Rating'].transform('mean')
imdb['Actor3_encoded'] = actor3_mean_rating

2. Splitting into training and testing¶


In [29]:
# Keeping the predictor and target variable

X = imdb[[ 'Year', 'Votes', 'Duration', 'Genre_mean_rating','Director_encoded','Actor1_encoded', 'Actor2_encoded', 'Actor3_encoded']]
y = imdb['Rating']
In [30]:
# Splitting the dataset into training and testing parts

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

🎯 Model Building 🎯

In [31]:
# Building 2 machine learning models and training them

lr = LinearRegression()
lr.fit(X_train,y_train)
lr_pred = lr.predict(X_test)


rf = RandomForestRegressor()
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)

⚡ Model Evaluation ⚡

In [32]:
# Evaluating the performance of trained algos

print('The performance evaluation of Logistic Regression is below: ', '\n')
print('Mean squared error: ',mean_squared_error(y_test, lr_pred))
print('Mean absolute error: ',mean_absolute_error(y_test, lr_pred))
print('R2 score: ',r2_score(y_test, lr_pred))
print('\n', '='*100, '\n')

print('The performance evaluation of Random Forest Regressor is below: ', '\n')
print('Mean squared error: ',mean_squared_error(y_test, rf_pred))
print('Mean absolute error: ',mean_absolute_error(y_test, rf_pred))
print('R2 score: ',r2_score(y_test, rf_pred))
The performance evaluation of Logistic Regression is below:  

Mean squared error:  0.13007622782536266
Mean absolute error:  0.2507994097724827
R2 score:  0.935188545523222

 ==================================================================================================== 

The performance evaluation of Random Forest Regressor is below:  

Mean squared error:  0.11293861764705879
Mean absolute error:  0.18987908496732034
R2 score:  0.9437274881146624

📊 Insights:

  • The performance evaluation revealed that Random Forest performed well than the Logistic Regression in terms of, r2_score which is 0.94

🔎 Model Testing 🔎

In [33]:
# Checking a sample of predictor values on whom the model is trained
X.head()
Out[33]:
Year Votes Duration Genre_mean_rating Director_encoded Actor1_encoded Actor2_encoded Actor3_encoded
1 2019 8 109 6.420152 7.000 6.850000 7.000000 7.000
10 2004 17 96 6.420152 6.200 5.766667 5.100000 6.200
11 2016 59 120 4.698529 5.900 5.900000 5.900000 5.900
30 2005 1002 116 6.420152 6.525 6.900000 6.866667 5.700
32 1993 15 168 6.420152 5.400 5.600000 6.400000 5.825
In [34]:
# Checking the rating according to above predictor variables
y.head()
Out[34]:
1     7.0
10    6.2
11    5.9
30    7.1
32    5.6
Name: Rating, dtype: float64
In [35]:
# Creating a new dataframe with values close to the 3rd row according to the sample above

data = {'Year': [2016], 'Votes': [58], 'Duration': [121], 'Genre_mean_rating': [4.5], 'Director_encoded': [5.8], 'Actor1_encoded': [5.9], 'Actor2_encoded': [5.9], 'Actor3_encoded': [5.900]}
df = pd.DataFrame(data)
In [36]:
# Predict the movie rating
predicted_rating = rf.predict(df)

# Display the predicted rating
print("Predicted Rating:", predicted_rating[0])
Predicted Rating: 5.849999999999995

📊 Insights:

  • We took the original data from X.head() and created a new dataframe almost similar to row 3. The original rating was 5.9. Out trained random forest regressor predicted 5.85 on unseen data. Therefore, showing the robustness of model.

🎈 Conclusion 🎈

Conclusion:


In this Jupyter notebook project, we embarked on a journey to analyze and predict movie ratings. We encountered a variety of data challenges, such as missing values, typos in column names, and duplicated records. Through a series of data cleaning and preprocessing steps, we were able to prepare the dataset for analysis. Our analysis uncovered several interesting insights about the movie dataset. We observed trends in movie durations, genre popularity, the most prolific actors and directors, and the distribution of movie ratings and votes over the years. Notably, we found that short-duration movies tend to receive higher ratings and votes, and the Drama genre has consistently performed well in terms of ratings. Furthermore, our evaluation of machine learning models revealed that Random Forest outperformed Linear Regression, with an impressive R-squared score of 0.94 on unseen data, highlighting the model's robustness.


Insights:

Following are the main insights:

  • Data cleaning was essential, involving the correction of typos and handling missing/duplicated values.

  • We explored the temporal dimension of the data, noting the first entry in 1931 and a movie with just 45 minutes of duration.

  • Mithun is the most frequently appearing lead actor

  • We identified both the best and worst-performing movies in terms of votes and ratings.

  • Insights on directors with the most and least movies were gained.

  • The distribution of movies over the years is skewed, with a concentration in the 2015-2019 period.

  • In 2010, some movies had the highest average votes.

  • Short-duration movies tend to receive higher ratings and votes, indicating a potential preference for shorter films.

  • Drama is a consistently popular genre, while Comedy and Action genres had their origins in 1953 and 1964, respectively.

  • The distribution of ratings and votes follows Gaussian-like patterns, with specific peaks and trends over time.

  • Random Forest regression outperformed Linear Regression with an R-squared score of 0.94, indicating its robustness.

Through our analysis, we gained a deep understanding of the dataset and its trends. This knowledge can be leveraged to make informed decisions regarding movie production, genres, and more. Future work could involve building more advanced machine learning models or diving deeper into specific genres or time periods to uncover additional insights.


What's next?

To further enhance this project, one can consider the following:

  • Explore additional machine learning models and fine-tune their parameters to improve rating prediction accuracy.

  • Investigate the relationships between various features, such as the impact of specific actors, directors, or genres on movie ratings.

  • Conduct sentiment analysis on movie reviews or incorporate external data sources to gain more insights into factors influencing movie ratings.

  • Create visualizations and dashboards to make the insights more accessible and engaging for stakeholders.

  • Continue updating the dataset with new movie releases and ratings for ongoing analysis and predictions.


thanks.jpg